Craftcans.com: scraping with BeautifulSoup

This notebook will explain how to get the same data however using BeautifulSoup package instead of pandas.



In [1]:

    
import requests, pandas
from BeautifulSoup import *



In [2]:

    
url = "http://craftcans.com/db.php?search=all&sort=beerid&ord=desc&view=text"

Option 3: BeautifulSoup



In [3]:

    
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page)

If one goes to the website and uses the inspect element feature of Google chrome, then it can be seen that this table (although has no class or ID) jas a style attrbute of width:100%;margin-top:10px; value. We can use it to identify the correc ttable from the page.



In [4]:

    
table = soup.find("table",attrs={"style":"width:100%;margin-top:10px;"})

Now once we found the table, we need to go row-by-row, read all the columns for each row and save the text inside. Let's save it as a dictionary, and then paste all the dictionaries into a lsit (thus, get a JSON file). Please note, that the BEER column is a bit different: the value inside table cell is in bold (e.g. <b> tag). Thus we should first find the <b> tag, and then only go for the text content.



In [5]:

    
# find all the rows of the table and save them into the rows variable
rows = table.findAll("tr")
# create and empty list to be filled in with dictionaires
data_list = []
# for each row in the list of rows:
for row in rows:
    columns = row.findAll("td") # find all columns in that row
    # and create a dictionary, where we give the key and get the text content as value
    beer = {
        "id":columns[0].text,
        "beer":columns[1].find('b').text,
        "brewery":columns[2].text,
        "location":columns[3].text,
        "style":columns[4].text,
        "size":columns[5].text,
        "abv":columns[6].text,
        "ibu":columns[7].text
    }
    # append the dictionary to the list
    data_list.append(beer)

Let's see the result. The first 5 dictionaires must be enough.



In [7]:

    
data_list[:5]









    Out[7]:





[{'abv': u'ABV',
  'beer': u'BEER',
  'brewery': u'BREWERY',
  'ibu': u'IBUs',
  'id': u'ENTRY',
  'location': u'LOCATION',
  'size': u'SIZE',
  'style': u'STYLE'},
 {'abv': u'4.5%',
  'beer': u'Get Together',
  'brewery': u'NorthGate Brewing',
  'ibu': u'50',
  'id': u'2692.',
  'location': u'Minneapolis,MN',
  'size': u'16 oz.',
  'style': u'American IPA'},
 {'abv': u'4.9%',
  'beer': u"Maggie's Leap",
  'brewery': u'NorthGate Brewing',
  'ibu': u'26',
  'id': u'2691.',
  'location': u'Minneapolis,MN',
  'size': u'16 oz.',
  'style': u'Milk / Sweet Stout'},
 {'abv': u'4.8%',
  'beer': u"Wall's End",
  'brewery': u'NorthGate Brewing',
  'ibu': u'19',
  'id': u'2690.',
  'location': u'Minneapolis,MN',
  'size': u'16 oz.',
  'style': u'English Brown Ale'},
 {'abv': u'6.0%',
  'beer': u'Pumpion',
  'brewery': u'NorthGate Brewing',
  'ibu': u'38',
  'id': u'2689.',
  'location': u'Minneapolis,MN',
  'size': u'16 oz.',
  'style': u'Pumpkin Ale'}]

If you are more comfortable with working in Dataframes, when the conversion can easility be done.



In [8]:

    
data = pandas.DataFrame(data_list)



In [9]:

    
data.head()









    Out[9]:






  
    
      
      abv
      beer
      brewery
      ibu
      id
      location
      size
      style
    
  
  
    
      0
      ABV
      BEER
      BREWERY
      IBUs
      ENTRY
      LOCATION
      SIZE
      STYLE
    
    
      1
      4.5%
      Get Together
      NorthGate Brewing
      50
      2692.
      Minneapolis,MN
      16 oz.
      American IPA
    
    
      2
      4.9%
      Maggie's Leap
      NorthGate Brewing
      26
      2691.
      Minneapolis,MN
      16 oz.
      Milk / Sweet Stout
    
    
      3
      4.8%
      Wall's End
      NorthGate Brewing
      19
      2690.
      Minneapolis,MN
      16 oz.
      English Brown Ale
    
    
      4
      6.0%
      Pumpion
      NorthGate Brewing
      38
      2689.
      Minneapolis,MN
      16 oz.
      Pumpkin Ale

Let's this time save the resulted data to a JSON file.



In [10]:

    
import json
with open("craftcans.json","w") as f:
    json.dump(data_list,f,sort_keys = True, indent = 4)

	abv	beer	brewery	ibu	id	location	size	style
0	ABV	BEER	BREWERY	IBUs	ENTRY	LOCATION	SIZE	STYLE
1	4.5%	Get Together	NorthGate Brewing	50	2692.	Minneapolis,MN	16 oz.	American IPA
2	4.9%	Maggie's Leap	NorthGate Brewing	26	2691.	Minneapolis,MN	16 oz.	Milk / Sweet Stout
3	4.8%	Wall's End	NorthGate Brewing	19	2690.	Minneapolis,MN	16 oz.	English Brown Ale
4	6.0%	Pumpion	NorthGate Brewing	38	2689.	Minneapolis,MN	16 oz.	Pumpkin Ale